Text style transfer aims to alter the style of a sentence while preserving its content. Due to the lack of parallel corpora, most recent work focuses on unsupervised methods and often uses cycle construction to train models. Since cycle construction helps to improve the style transfer ability of the model by rebuilding transferred sentences back to original-style sentences, it brings about a content loss in unsupervised text style transfer tasks. In this paper, we propose a novel disentanglement-based style transfer model StyleFlow to enhance content preservation. Instead of the typical encoder-decoder scheme, StyleFlow can not only conduct the forward process to obtain the output, but also infer to the input through the output. We design an attention-aware coupling layers to disentangle the content representations and the style representations of a sentence. Besides, we propose a data augmentation method based on Normalizing Flow to improve the robustness of the model. Experiment results demonstrate that our model preserves content effectively and achieves the state-of-the-art performance on the most metrics.
translated by 谷歌翻译
操纵可变形的线性对象(DLOS)在有障碍的受约束环境中实现所需的形状是一项有意义但具有挑战性的任务。对于这项高度约束的任务是必要的;但是,由于规划人员的可变形性质,计划人员需要的准确模型很难获得,并且不可避免的建模错误会显着影响计划结果,如果机器人只是以开环的方式执行计划的路径,则可能导致任务失败。在本文中,我们提出了一个粗到精细的框架,以结合全球计划和局部控制,以进行双臂操纵DLO,能够精确实现所需的配置并避免DLO,机器人和障碍物之间的潜在碰撞。具体而言,全球规划师是指一个简单而有效的DLO能量模型,并计算出一条粗略的途径,以确保任务的可行性。然后,本地控制器遵循该路径作为指导,并通过闭环反馈进一步塑造它,以补偿计划错误并保证任务的准确性。仿真和现实世界实验都表明,我们的框架可以在使用不精确的DLO模型的受约束环境中稳健地实现所需的DLO配置。仅通过计划或控制就无法可靠地实现。
translated by 谷歌翻译
可变形线性对象(DLOS)的机器人操纵在许多领域都具有广泛的应用前景。但是,一个关键问题是获得确切的变形模型(即机器人运动如何影响DLO变形),这些模型在不同的DLOS之间很难计算和变化。因此,DLOS的形状控制具有挑战性,尤其是对于需要全球和更准确模型的大型变形控制。在本文中,我们提出了一种离线和在线数据驱动的方法,用于有效地学习全球变形模型,从而可以通过离线学习进行准确的建模,并通过在线适应进行新的DLOS进行进一步更新。具体而言,由神经网络近似的模型首先是在随机数据的离线训练中,然后无缝迁移到在线阶段,并在实际操纵过程中进一步在线更新。引入了几种策略,以提高模型的效率和泛化能力。我们提出了一个基于凸优化的控制器,并使用Lyapunov方法分析系统的稳定性。详细的仿真和现实世界实验表明,我们的方法可以有效,精确地估计变形模型,并在2D和3D双臂操纵任务中对未经训练的DLO进行大型变形控制,而不是现有方法。它仅使用仿真数据进行离线学习来完成所有24个任务,并在现实世界中不同的DLO上具有不同的所需形状。
translated by 谷歌翻译
空间冗余广泛存在于视觉识别任务中,即图像或视频帧中的判别特征通常对应于像素的子集,而剩余区域与手头的任务无关。因此,在时间和空间消耗方面,处理具有相等计算量的所有像素的静态模型导致相当冗余。在本文中,我们将图像识别问题标准为顺序粗致细特征学习过程,模仿人类视觉系统。具体地,所提出的浏览和焦点网络(GFNET)首先以低分辨率比例提取输入图像的快速全局表示,然后策略性地参加一系列突出(小)区域以学习更精细的功能。顺序过程自然地促进了在测试时间的自适应推断,因为一旦模型对其预测充分信心,可以终止它,避免了进一步的冗余计算。值得注意的是,在我们模型中定位判别区域的问题被制定为增强学习任务,因此不需要除分类标签之外的其他手动注释。 GFNET是一般的,灵活,因为它与任何现成的骨干网型号(例如MobileCenets,Abservennet和TSM)兼容,可以方便地部署为特征提取器。对各种图像分类和视频识别任务的广泛实验以及各种骨干模型,证明了我们方法的显着效率。例如,它通过1.3倍降低了高效MobileNet-V3的平均等待时间,而不会牺牲精度。代码和预先训练的模型可在https://github.com/blackfeather-wang/gfnet-pytorch获得。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
Domain adaptation methods reduce domain shift typically by learning domain-invariant features. Most existing methods are built on distribution matching, e.g., adversarial domain adaptation, which tends to corrupt feature discriminability. In this paper, we propose Discriminative Radial Domain Adaptation (DRDR) which bridges source and target domains via a shared radial structure. It's motivated by the observation that as the model is trained to be progressively discriminative, features of different categories expand outwards in different directions, forming a radial structure. We show that transferring such an inherently discriminative structure would enable to enhance feature transferability and discriminability simultaneously. Specifically, we represent each domain with a global anchor and each category a local anchor to form a radial structure and reduce domain shift via structure matching. It consists of two parts, namely isometric transformation to align the structure globally and local refinement to match each category. To enhance the discriminability of the structure, we further encourage samples to cluster close to the corresponding local anchors based on optimal-transport assignment. Extensively experimenting on multiple benchmarks, our method is shown to consistently outperforms state-of-the-art approaches on varied tasks, including the typical unsupervised domain adaptation, multi-source domain adaptation, domain-agnostic learning, and domain generalization.
translated by 谷歌翻译